Introduction

This exercise explores both classification and regression problems, applying a range of models to predict Airbnb superhost status (classification) and nightly prices (regression). The goal is to compare different approaches, from traditional statistical methods to machine learning models, to understand their strengths and weaknesses in different descriptive and predictive tasks.

library(tidyverse)
library(dplyr)
library(readr)
library(lubridate)
library(DataExplorer)
library(mice)   
library(leaflet)
library(corrplot)       
library(GGally)         
library(gridExtra)
library(caret)
library(ggplot2)
library(ggeffects)
library(rpart)
library(rpart.plot)
library(pROC)    
library(pdp)
library(xgboost)
library(olsrr)

#for reproduceability
set.seed(123)

Data Cleaning & Processing

Data Selection & Cleaning

I will start by loading the data.

airbnb_data <- read_csv("listings.csv")

Next I will select the most useful columns for my analysis.

# select relevant columns for analysis
airbnb_reduced <- airbnb_data |> 
  dplyr::select(
    
    # target variables
    host_is_superhost,       # classification target
    price,                   # regression target
    
    # location
    neighbourhood_cleansed,  
    latitude,
    longitude,
    
    # property characteristics
    room_type,
    accommodates,
    bathrooms,
    bedrooms,
    beds,
    
    # host attributes
    host_response_time,
    host_response_rate,
    host_acceptance_rate,
    host_identity_verified,
    host_has_profile_pic,
    host_since,
    host_listings_count,
    
    # review metrics
    review_scores_rating,
    review_scores_cleanliness,
    review_scores_checkin,
    review_scores_communication,
    review_scores_location,
    review_scores_value,
    number_of_reviews,
    
    # availability and booking policies
    minimum_nights,
    maximum_nights,
    availability_365,
    instant_bookable
    )

Then I will clean and preprocess the variables, converting variables to numeric or factor data where necessary.

# clean and preprocess variables
airbnb_clean <- airbnb_reduced |> 
  mutate(
    # convert price from string to numeric (remove $ and commas)
    price = as.numeric(gsub("[\\$,]", "", price)),
    
    # convert categorical variables to factors as specified
    neighbourhood_cleansed = factor(neighbourhood_cleansed),
    room_type = factor(room_type),
    
    # convert response time to ordered factor
    host_response_time = factor(host_response_time, 
                              levels = c("within an hour", "within a few hours", 
                                         "within a day", "a few days or more")),
    
    # convert rates from percentage strings to numeric as requested
    host_response_rate = as.numeric(gsub("[\\%]", "", host_response_rate)) / 100,
    host_acceptance_rate = as.numeric(gsub("[\\%]", "", host_acceptance_rate)) / 100,
    
    # convert logical variables to factors (NA values stay as NA)
    host_is_superhost = factor(host_is_superhost, levels = c(FALSE, TRUE), labels = c("No", "Yes")),
    host_identity_verified = factor(host_identity_verified, levels = c(FALSE, TRUE), labels = c("No", "Yes")),
    host_has_profile_pic = factor(host_has_profile_pic, levels = c(FALSE, TRUE), labels = c("No", "Yes")),
    instant_bookable = factor(instant_bookable, levels = c(FALSE, TRUE), labels = c("No", "Yes")),
    
    # convert host_since to days from today as requested
    host_since = as.Date(host_since),
    host_tenure_days = as.numeric(difftime(Sys.Date(), host_since, units = "days")),
    
    # cap maximum nights at 365 (as per airbnb policy)
    maximum_nights = pmin(maximum_nights, 365),
    minimum_nights = pmin(minimum_nights, 365)
  )
  
# remove the original host_since column as we now have host_tenure_days
airbnb_clean <- airbnb_clean |>
  dplyr::select(-host_since)

# check the data structure and types
str(airbnb_clean)
## tibble [7,863 × 28] (S3: tbl_df/tbl/data.frame)
##  $ host_is_superhost          : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 NA 1 2 ...
##  $ price                      : num [1:7863] 140 360 NA 48 NA 149 48 52 95 NA ...
##  $ neighbourhood_cleansed     : Factor w/ 37 levels "Bayview","Bernal Heights",..: 18 37 19 22 22 19 30 14 13 12 ...
##  $ latitude                   : num [1:7863] 37.8 37.8 37.8 37.7 37.7 ...
##  $ longitude                  : num [1:7863] -122 -122 -122 -122 -122 ...
##  $ room_type                  : Factor w/ 4 levels "Entire home/apt",..: 1 1 3 3 1 1 3 3 1 3 ...
##  $ accommodates               : num [1:7863] 4 4 1 1 2 4 2 4 2 1 ...
##  $ bathrooms                  : num [1:7863] 1 1 NA 1 NA 1 1 0 1 NA ...
##  $ bedrooms                   : num [1:7863] 2 2 NA 1 NA 2 1 1 1 NA ...
##  $ beds                       : num [1:7863] 2 2 NA 1 NA 2 1 1 1 NA ...
##  $ host_response_time         : Factor w/ 4 levels "within an hour",..: 1 1 NA 2 NA 4 4 1 1 1 ...
##  $ host_response_rate         : num [1:7863] 0.88 1 NA 1 NA 0.33 0 1 1 1 ...
##  $ host_acceptance_rate       : num [1:7863] 0.99 0.95 NA NA NA 0 0 0.93 0.99 0.92 ...
##  $ host_identity_verified     : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ host_has_profile_pic       : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ host_listings_count        : num [1:7863] 19 5 1 1 1 1 1 38 43 19 ...
##  $ review_scores_rating       : num [1:7863] 4.5 NA 5 NA NA NA NA NA 5 2.5 ...
##  $ review_scores_cleanliness  : num [1:7863] 4.25 NA 5 NA NA NA NA NA 4.78 3 ...
##  $ review_scores_checkin      : num [1:7863] 5 NA 5 NA NA NA NA NA 5 3.5 ...
##  $ review_scores_communication: num [1:7863] 4.75 NA 5 NA NA NA NA NA 5 3.5 ...
##  $ review_scores_location     : num [1:7863] 4.5 NA 5 NA NA NA NA NA 4.89 4.5 ...
##  $ review_scores_value        : num [1:7863] 4.25 NA 5 NA NA NA NA NA 4.67 2 ...
##  $ number_of_reviews          : num [1:7863] 4 0 1 0 0 0 0 0 9 2 ...
##  $ minimum_nights             : num [1:7863] 30 30 30 30 30 30 30 30 30 32 ...
##  $ maximum_nights             : num [1:7863] 365 365 75 365 365 60 365 365 365 90 ...
##  $ availability_365           : num [1:7863] 41 277 0 364 0 54 89 365 305 365 ...
##  $ instant_bookable           : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 2 2 1 ...
##  $ host_tenure_days           : num [1:7863] 5030 1664 4757 525 3157 ...

Handling Missing Data

plot_intro(airbnb_clean)

#non-numeric variables
names(airbnb_clean)[!sapply(airbnb_clean, is.numeric)]
## [1] "host_is_superhost"      "neighbourhood_cleansed" "room_type"             
## [4] "host_response_time"     "host_identity_verified" "host_has_profile_pic"  
## [7] "instant_bookable"

I can see that there quite a few rows with missing observations, so I will impute with MICE.

pred_matrix <- make.predictorMatrix(airbnb_clean)

imputation_methods <- make.method(airbnb_clean)
# use PMM for all numeric variables
imputation_methods[c("price", "accommodates", "bathrooms", "bedrooms", "beds", 
                    "host_response_rate", "host_acceptance_rate", "host_listings_count",
                    "review_scores_rating", "review_scores_cleanliness", "review_scores_checkin", 
                    "review_scores_communication", "review_scores_location", "review_scores_value",
                    "number_of_reviews", "minimum_nights", "maximum_nights", 
                    "availability_365", "host_tenure_days")] <- "pmm"

# run MICE with explicit methods
mice_mod <- mice(airbnb_clean, m = 4, seed = 123, method = imputation_methods, 
                predictorMatrix = pred_matrix, print = FALSE)
## Warning: Number of logged events: 400
# get the completed dataset
airbnb_clean <- complete(mice_mod, 1)

# check types after imputation
str(airbnb_clean$price)
##  num [1:7863] 140 360 225 48 184 149 48 52 95 255 ...

Now I have a clean dataset that is ready for analysis.

EDA

Visualizing Target Variables

I will use host_is_superhost (T/F) as my classification target and price as my regression target.

# simple visualization of superhost status (classification target)
ggplot(airbnb_clean, aes(x = host_is_superhost, fill = host_is_superhost)) +
  geom_bar() +
  labs(title = "Distribution of Superhost Status") +
  theme_minimal()

# simple visualization of price distribution (regression target)
ggplot(airbnb_clean, aes(x = price)) +
  geom_histogram(fill = "steelblue", bins = 50) +
  scale_x_continuous(limits = c(0, 500)) +
  labs(title = "Distribution of Airbnb Listing Prices",
       x = "Price ($)",
       y = "Count") +
  theme_minimal()

# relationship between target variables (price by superhost status)
ggplot(airbnb_clean, aes(x = host_is_superhost, y = price, fill = host_is_superhost)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 500)) +
  labs(title = "Price by Superhost Status") +
  theme_minimal()

I can see the classification target is fairly balanced, with about 4500 “No” and 3300 “Yes” instances. Price is right skewed, with many prices falling around $100/night. While the median price is nearly identical for Superhosts and non-Superhosts, Superhosts have more high-end outliers, suggesting a higher number of premium-priced listings.

Numeric Variable Distributions

#plot histograms
df_long <- airbnb_clean |> 
  dplyr::select(where(is.numeric)) |> 
  pivot_longer(cols = everything(), names_to = "feature", values_to = "value")

ggplot(df_long, aes(x = value)) +
  geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
  facet_wrap(~feature, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Numeric Features")

Log Transformations for Linear Models

# create a transformed version of the dataset for models that need it
airbnb_final <- airbnb_clean |> 
  mutate(
    # log transformations for heavily skewed variables
    host_listings_count_log = log1p(host_listings_count),
    number_of_reviews_log = log1p(number_of_reviews),
    minimum_nights_log = log1p(minimum_nights),
    maximum_nights_log = log1p(maximum_nights),
    # I will also add some interactions here
    coords_squared = latitude^2 + longitude^2,
    coords_mult = latitude * longitude
  )

I applied log transformations to heavily right-skewed variables (host_listings_count, number_of_reviews, minimum_nights, maximum_nights) to normalize their distributions, which benefits linear models and distance-based algorithms.

Geographic Variables

I will also visualize the relationship between geographic variables (latitude and longitude) and my regression target, price.

# cap price at $500
airbnb_final$price_capped <- pmin(airbnb_final$price, 500)

# create a color palette based on the capped price
color_pal <- colorNumeric(palette = "RdYlBu", domain = airbnb_final$price_capped, reverse = TRUE)

# create a map showing price distribution with capped prices
map <- leaflet(airbnb_final) |> 
  addProviderTiles(providers$CartoDB.Positron) |> 
  setView(lng = median(airbnb_final$longitude, na.rm = TRUE), 
          lat = median(airbnb_final$latitude, na.rm = TRUE), 
          zoom = 12) |> 
  addCircles(
    lng = ~longitude, 
    lat = ~latitude,
    radius = ~sqrt(price_capped)*5,  
    color = ~color_pal(price_capped), 
    popup = ~paste(
      "Price: $", price_capped, "<br>",
      "Room type: ", room_type, "<br>",
      "Accommodates: ", accommodates
    )
  ) |> 
  addLegend(position = "bottomright", 
            pal = color_pal, 
            values = ~price_capped, 
            title = "Price ($, capped at 500)")
map

Higher-priced listings appear to be clustered in specific areas, especially in neighborhoods in the northeastern part of the city, while lower-priced options are more widely dispersed.

Correlations

ggcorr(airbnb_clean, label = T)
## Warning in ggcorr(airbnb_clean, label = T): data in column(s)
## 'host_is_superhost', 'neighbourhood_cleansed', 'room_type',
## 'host_response_time', 'host_identity_verified', 'host_has_profile_pic',
## 'instant_bookable' are not numeric and were ignored

There is moderate to high correlation between review variables and size variables (accommodates, bathrooms, bedrooms). While this is not a problem for ML/tree-based models, I may need to consider regularization techniques to prevent multicollinearity issues when using linear models.

Classification (Interpretation)

Data Splitting

For model validation, I’m using a 70/30 split between training and testing data. With over 7,700 observations in the dataset, this gives us sufficient data for both training the models and evaluating their performance on unseen data.

in_train <- createDataPartition(airbnb_clean$host_is_superhost, p = 0.7, list = FALSE)  # 70% for training, my choice
training <- airbnb_final[ in_train,]
testing <- airbnb_final[-in_train,]
nrow(training)
## [1] 5505
nrow(testing)
## [1] 2358

I’ve also configured 5-fold cross-validation for model training. Given the dataset size, 5 folds offers a good balance between computational efficiency and validation quality. This approach trains each model on different subsets of the training data, providing more reliable performance estimates than a single train/test split alone.

ctrl <- trainControl(method = "repeatedcv", 
                     number = 5, #my choice since dataset is somewhat large, if I had more time I would choose 10
                     classProbs = T,
                     summaryFunction=twoClassSummary,
                     verboseIter = F) #will improve html readability

Logistic (Binary) Regression

This wouldn’t be possible to interpret with 100s of variables, but since I only have 32 I’ll run a logistic regression with all variables to identify significant predictors of superhost status.

# train a logistic regression model with all variables
logit_model <- glm(host_is_superhost ~ ., family = binomial, data = training)
summary(logit_model)
## 
## Call:
## glm(formula = host_is_superhost ~ ., family = binomial, data = training)
## 
## Coefficients:
##                                               Estimate Std. Error z value
## (Intercept)                                  2.107e+05  1.633e+06   0.129
## price                                        1.709e-05  4.368e-05   0.391
## neighbourhood_cleansedBernal Heights         3.768e-01  4.618e-01   0.816
## neighbourhood_cleansedCastro/Upper Market    1.031e+00  6.390e-01   1.613
## neighbourhood_cleansedChinatown              1.195e+00  7.618e-01   1.569
## neighbourhood_cleansedCrocker Amazon         1.188e+00  6.878e-01   1.727
## neighbourhood_cleansedDiamond Heights        7.934e-01  1.049e+00   0.756
## neighbourhood_cleansedDowntown/Civic Center  7.381e-01  6.936e-01   1.064
## neighbourhood_cleansedExcelsior              4.831e-01  4.543e-01   1.063
## neighbourhood_cleansedFinancial District     5.965e-01  7.418e-01   0.804
## neighbourhood_cleansedGlen Park              4.867e-01  6.381e-01   0.763
## neighbourhood_cleansedGolden Gate Park       1.227e+00  9.822e-01   1.249
## neighbourhood_cleansedHaight Ashbury         1.193e+00  6.902e-01   1.728
## neighbourhood_cleansedInner Richmond         1.490e+00  7.371e-01   2.022
## neighbourhood_cleansedInner Sunset           1.723e+00  7.278e-01   2.367
## neighbourhood_cleansedLakeshore              1.197e+00  8.878e-01   1.349
## neighbourhood_cleansedMarina                 1.931e+00  7.854e-01   2.459
## neighbourhood_cleansedMission                9.106e-01  5.425e-01   1.679
## neighbourhood_cleansedNob Hill               7.290e-01  7.433e-01   0.981
## neighbourhood_cleansedNoe Valley             8.532e-01  5.909e-01   1.444
## neighbourhood_cleansedNorth Beach            1.530e+00  8.228e-01   1.859
## neighbourhood_cleansedOcean View             4.850e-01  7.333e-01   0.661
## neighbourhood_cleansedOuter Mission          2.652e-01  6.072e-01   0.437
## neighbourhood_cleansedOuter Richmond         2.283e+00  7.779e-01   2.935
## neighbourhood_cleansedOuter Sunset           2.075e+00  7.530e-01   2.756
## neighbourhood_cleansedPacific Heights        1.328e+00  7.551e-01   1.759
## neighbourhood_cleansedParkside               1.729e+00  8.048e-01   2.148
## neighbourhood_cleansedPotrero Hill           3.491e-01  5.135e-01   0.680
## neighbourhood_cleansedPresidio               1.845e+00  1.069e+00   1.726
## neighbourhood_cleansedPresidio Heights       2.785e+00  9.202e-01   3.026
## neighbourhood_cleansedRussian Hill           1.663e+00  8.015e-01   2.075
## neighbourhood_cleansedSeacliff               1.262e+00  1.171e+00   1.078
## neighbourhood_cleansedSouth of Market        8.911e-01  6.396e-01   1.393
## neighbourhood_cleansedTreasure Island/YBI   -1.226e+01  2.400e+03  -0.005
## neighbourhood_cleansedTwin Peaks             1.275e+00  7.602e-01   1.677
## neighbourhood_cleansedVisitacion Valley      8.420e-01  4.876e-01   1.727
## neighbourhood_cleansedWest of Twin Peaks     1.174e+00  7.112e-01   1.651
## neighbourhood_cleansedWestern Addition       1.773e+00  6.843e-01   2.590
## latitude                                     9.477e+02  2.435e+04   0.039
## longitude                                    3.713e+03  2.057e+04   0.180
## room_typeHotel room                         -1.519e+01  2.142e+02  -0.071
## room_typePrivate room                       -5.303e-01  9.132e-02  -5.808
## room_typeShared room                        -2.108e+00  4.989e-01  -4.226
## accommodates                                 3.770e-02  3.558e-02   1.060
## bathrooms                                    1.777e-01  6.313e-02   2.815
## bedrooms                                    -4.485e-02  6.542e-02  -0.686
## beds                                        -3.272e-02  5.012e-02  -0.653
## host_response_timewithin a few hours        -1.550e-01  1.023e-01  -1.514
## host_response_timewithin a day              -4.962e-01  1.591e-01  -3.119
## host_response_timea few days or more        -7.590e-02  1.230e+00  -0.062
## host_response_rate                           4.098e+00  8.120e-01   5.046
## host_acceptance_rate                         2.330e+00  2.275e-01  10.242
## host_identity_verifiedYes                    4.377e-01  1.253e-01   3.493
## host_has_profile_picYes                      1.542e-01  4.324e-01   0.357
## host_listings_count                         -3.519e-03  5.035e-04  -6.990
## review_scores_rating                         2.652e+00  3.020e-01   8.780
## review_scores_cleanliness                    5.378e-01  1.979e-01   2.717
## review_scores_checkin                        5.713e-02  2.560e-01   0.223
## review_scores_communication                  1.980e-01  2.580e-01   0.767
## review_scores_location                       4.535e-01  1.649e-01   2.750
## review_scores_value                         -9.916e-02  1.914e-01  -0.518
## number_of_reviews                           -1.854e-03  4.688e-04  -3.955
## minimum_nights                              -8.645e-03  3.506e-03  -2.466
## maximum_nights                               3.235e-03  7.350e-04   4.401
## availability_365                            -3.561e-04  2.849e-04  -1.250
## instant_bookableYes                         -6.567e-01  8.980e-02  -7.313
## host_tenure_days                            -3.526e-05  2.855e-05  -1.235
## host_listings_count_log                      4.669e-01  4.217e-02  11.072
## number_of_reviews_log                        5.823e-01  3.305e-02  17.620
## minimum_nights_log                           1.275e-02  6.228e-02   0.205
## maximum_nights_log                          -3.570e-01  9.197e-02  -3.882
## coords_squared                               1.802e+01  7.100e+01   0.254
## coords_mult                                  1.897e+01  1.808e+02   0.105
## price_capped                                -1.362e-03  4.121e-04  -3.305
##                                             Pr(>|z|)    
## (Intercept)                                 0.897358    
## price                                       0.695654    
## neighbourhood_cleansedBernal Heights        0.414488    
## neighbourhood_cleansedCastro/Upper Market   0.106685    
## neighbourhood_cleansedChinatown             0.116642    
## neighbourhood_cleansedCrocker Amazon        0.084097 .  
## neighbourhood_cleansedDiamond Heights       0.449543    
## neighbourhood_cleansedDowntown/Civic Center 0.287282    
## neighbourhood_cleansedExcelsior             0.287621    
## neighbourhood_cleansedFinancial District    0.421314    
## neighbourhood_cleansedGlen Park             0.445628    
## neighbourhood_cleansedGolden Gate Park      0.211652    
## neighbourhood_cleansedHaight Ashbury        0.084020 .  
## neighbourhood_cleansedInner Richmond        0.043207 *  
## neighbourhood_cleansedInner Sunset          0.017925 *  
## neighbourhood_cleansedLakeshore             0.177410    
## neighbourhood_cleansedMarina                0.013927 *  
## neighbourhood_cleansedMission               0.093233 .  
## neighbourhood_cleansedNob Hill              0.326698    
## neighbourhood_cleansedNoe Valley            0.148822    
## neighbourhood_cleansedNorth Beach           0.063011 .  
## neighbourhood_cleansedOcean View            0.508318    
## neighbourhood_cleansedOuter Mission         0.662298    
## neighbourhood_cleansedOuter Richmond        0.003338 ** 
## neighbourhood_cleansedOuter Sunset          0.005853 ** 
## neighbourhood_cleansedPacific Heights       0.078650 .  
## neighbourhood_cleansedParkside              0.031727 *  
## neighbourhood_cleansedPotrero Hill          0.496596    
## neighbourhood_cleansedPresidio              0.084329 .  
## neighbourhood_cleansedPresidio Heights      0.002474 ** 
## neighbourhood_cleansedRussian Hill          0.037992 *  
## neighbourhood_cleansedSeacliff              0.281120    
## neighbourhood_cleansedSouth of Market       0.163599    
## neighbourhood_cleansedTreasure Island/YBI   0.995922    
## neighbourhood_cleansedTwin Peaks            0.093492 .  
## neighbourhood_cleansedVisitacion Valley     0.084213 .  
## neighbourhood_cleansedWest of Twin Peaks    0.098740 .  
## neighbourhood_cleansedWestern Addition      0.009592 ** 
## latitude                                    0.968954    
## longitude                                   0.856768    
## room_typeHotel room                         0.943461    
## room_typePrivate room                       6.34e-09 ***
## room_typeShared room                        2.38e-05 ***
## accommodates                                0.289330    
## bathrooms                                   0.004876 ** 
## bedrooms                                    0.492937    
## beds                                        0.513913    
## host_response_timewithin a few hours        0.129982    
## host_response_timewithin a day              0.001816 ** 
## host_response_timea few days or more        0.950798    
## host_response_rate                          4.50e-07 ***
## host_acceptance_rate                         < 2e-16 ***
## host_identity_verifiedYes                   0.000477 ***
## host_has_profile_picYes                     0.721384    
## host_listings_count                         2.76e-12 ***
## review_scores_rating                         < 2e-16 ***
## review_scores_cleanliness                   0.006591 ** 
## review_scores_checkin                       0.823389    
## review_scores_communication                 0.442835    
## review_scores_location                      0.005965 ** 
## review_scores_value                         0.604405    
## number_of_reviews                           7.66e-05 ***
## minimum_nights                              0.013671 *  
## maximum_nights                              1.08e-05 ***
## availability_365                            0.211341    
## instant_bookableYes                         2.61e-13 ***
## host_tenure_days                            0.216852    
## host_listings_count_log                      < 2e-16 ***
## number_of_reviews_log                        < 2e-16 ***
## minimum_nights_log                          0.837736    
## maximum_nights_log                          0.000104 ***
## coords_squared                              0.799644    
## coords_mult                                 0.916417    
## price_capped                                0.000949 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7479.4  on 5504  degrees of freedom
## Residual deviance: 5064.8  on 5431  degrees of freedom
## AIC: 5212.8
## 
## Number of Fisher Scoring iterations: 15

This model reveals that host responsiveness, acceptance rates, review quality, and property management are the most important predictors of superhost status. I’ll include these key variables in a more interpretable model.

# create a more interpretable model with selected variables
logit_simple <- glm(
  host_is_superhost ~ host_response_rate + 
                     host_acceptance_rate + 
                     review_scores_rating + 
                     number_of_reviews_log + 
                     host_listings_count_log + 
                     availability_365 + 
                     instant_bookable +
                     room_type +
                     minimum_nights_log +
                     maximum_nights_log,
  family = binomial, 
  data = training
)
summary(logit_simple)
## 
## Call:
## glm(formula = host_is_superhost ~ host_response_rate + host_acceptance_rate + 
##     review_scores_rating + number_of_reviews_log + host_listings_count_log + 
##     availability_365 + instant_bookable + room_type + minimum_nights_log + 
##     maximum_nights_log, family = binomial, data = training)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.179e+01  1.212e+00 -17.976  < 2e-16 ***
## host_response_rate       4.206e+00  6.741e-01   6.239 4.41e-10 ***
## host_acceptance_rate     2.524e+00  2.023e-01  12.475  < 2e-16 ***
## review_scores_rating     2.954e+00  2.016e-01  14.649  < 2e-16 ***
## number_of_reviews_log    4.807e-01  2.285e-02  21.036  < 2e-16 ***
## host_listings_count_log  6.545e-02  2.008e-02   3.260 0.001116 ** 
## availability_365        -8.098e-04  2.639e-04  -3.069 0.002150 ** 
## instant_bookableYes     -6.393e-01  8.168e-02  -7.826 5.02e-15 ***
## room_typeHotel room     -1.464e+01  1.522e+02  -0.096 0.923354    
## room_typePrivate room   -1.474e-01  7.239e-02  -2.037 0.041690 *  
## room_typeShared room    -1.755e+00  4.679e-01  -3.751 0.000176 ***
## minimum_nights_log      -7.095e-02  3.147e-02  -2.254 0.024170 *  
## maximum_nights_log       1.439e-02  3.028e-02   0.475 0.634763    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7479.4  on 5504  degrees of freedom
## Residual deviance: 5503.0  on 5492  degrees of freedom
## AIC: 5529
## 
## Number of Fisher Scoring iterations: 14
# calculate odds ratios for easier interpretation
odds_ratios <- exp(coef(logit_simple))
odds_ratios_df <- data.frame(
  Variable = names(odds_ratios),
  OddsRatio = odds_ratios
)
print(odds_ratios_df)
##                                        Variable    OddsRatio
## (Intercept)                         (Intercept) 3.456411e-10
## host_response_rate           host_response_rate 6.705480e+01
## host_acceptance_rate       host_acceptance_rate 1.247386e+01
## review_scores_rating       review_scores_rating 1.917780e+01
## number_of_reviews_log     number_of_reviews_log 1.617153e+00
## host_listings_count_log host_listings_count_log 1.067645e+00
## availability_365               availability_365 9.991905e-01
## instant_bookableYes         instant_bookableYes 5.276675e-01
## room_typeHotel room         room_typeHotel room 4.391720e-07
## room_typePrivate room     room_typePrivate room 8.629281e-01
## room_typeShared room       room_typeShared room 1.728552e-01
## minimum_nights_log           minimum_nights_log 9.315070e-01
## maximum_nights_log           maximum_nights_log 1.014490e+00

This simplified model reveals clear superhost predictors: responsive communication is paramount, followed by high ratings and acceptance rate. Having more reviews significantly increases superhost probability, while shared rooms reduce chances. Properties with lower availability and without instant booking are more likely to achieve superhost status, suggesting selective, attentive hosting is more effective than maximizing bookings. These results are visualized in the plots below:

# host response rate effect
gg_response <- ggpredict(logit_simple, terms = "host_response_rate")
## Data were 'prettified'. Consider using `terms="host_response_rate
##   [all]"` to get smooth plots.
plot(gg_response) + 
  labs(title = "Effect of Host Response Rate on Superhost Probability",
       x = "Host Response Rate",
       y = "Probability of being a Superhost")

gg_response
## # Predicted probabilities of host_is_superhost
## 
## host_response_rate | Predicted |     95% CI
## -------------------------------------------
##               0.00 |      0.01 | 0.00, 0.04
##               0.10 |      0.02 | 0.01, 0.05
##               0.30 |      0.04 | 0.02, 0.09
##               0.40 |      0.06 | 0.03, 0.12
##               0.50 |      0.09 | 0.05, 0.15
##               0.60 |      0.12 | 0.08, 0.19
##               0.70 |      0.18 | 0.13, 0.24
##               1.00 |      0.43 | 0.41, 0.46
## 
## Adjusted for:
## *    host_acceptance_rate =            0.85
## *    review_scores_rating =            4.78
## *   number_of_reviews_log =            2.33
## * host_listings_count_log =            2.33
## *        availability_365 =          186.29
## *        instant_bookable =              No
## *               room_type = Entire home/apt
## *      minimum_nights_log =            2.13
## *      maximum_nights_log =            4.89
## 
## Not all rows are shown in the output. Use `print(..., n = Inf)` to show
##   all rows.
# review scores rating effect
gg_rating <- ggpredict(logit_simple, terms = "review_scores_rating")
## Data were 'prettified'. Consider using `terms="review_scores_rating
##   [all]"` to get smooth plots.
plot(gg_rating) + 
  labs(title = "Effect of Review Scores Rating on Superhost Probability",
       x = "Review Scores Rating",
       y = "Probability of being a Superhost")

gg_rating
## # Predicted probabilities of host_is_superhost
## 
## review_scores_rating | Predicted |     95% CI
## ---------------------------------------------
##                    1 |      0.00 | 0.00, 0.00
##                    2 |      0.00 | 0.00, 0.00
##                    3 |      0.00 | 0.00, 0.01
##                    4 |      0.06 | 0.04, 0.09
##                    5 |      0.56 | 0.53, 0.58
## 
## Adjusted for:
## *      host_response_rate =            0.97
## *    host_acceptance_rate =            0.85
## *   number_of_reviews_log =            2.33
## * host_listings_count_log =            2.33
## *        availability_365 =          186.29
## *        instant_bookable =              No
## *               room_type = Entire home/apt
## *      minimum_nights_log =            2.13
## *      maximum_nights_log =            4.89
# availability effect
gg_avail <- ggpredict(logit_simple, terms = "availability_365")
## Data were 'prettified'. Consider using `terms="availability_365 [all]"`
##   to get smooth plots.
plot(gg_avail) + 
  labs(title = "Effect of Availability on Superhost Probability",
       x = "Days Available per Year",
       y = "Probability of being a Superhost")

gg_avail
## # Predicted probabilities of host_is_superhost
## 
## availability_365 | Predicted |     95% CI
## -----------------------------------------
##                0 |      0.44 | 0.40, 0.47
##               40 |      0.43 | 0.40, 0.46
##              100 |      0.42 | 0.39, 0.44
##              140 |      0.41 | 0.38, 0.43
##              180 |      0.40 | 0.38, 0.43
##              240 |      0.39 | 0.36, 0.41
##              280 |      0.38 | 0.35, 0.41
##              380 |      0.36 | 0.33, 0.40
## 
## Adjusted for:
## *      host_response_rate =            0.97
## *    host_acceptance_rate =            0.85
## *    review_scores_rating =            4.78
## *   number_of_reviews_log =            2.33
## * host_listings_count_log =            2.33
## *        instant_bookable =              No
## *               room_type = Entire home/apt
## *      minimum_nights_log =            2.13
## *      maximum_nights_log =            4.89
## 
## Not all rows are shown in the output. Use `print(..., n = Inf)` to show
##   all rows.
# number of reviews (log) effect
gg_reviews <- ggpredict(logit_simple, terms = "number_of_reviews_log")
## Data were 'prettified'. Consider using `terms="number_of_reviews_log
##   [all]"` to get smooth plots.
plot(gg_reviews) + 
  labs(title = "Effect of Number of Reviews (log) on Superhost Probability",
       x = "Log Number of Reviews",
       y = "Probability of being a Superhost")

gg_reviews
## # Predicted probabilities of host_is_superhost
## 
## number_of_reviews_log | Predicted |     95% CI
## ----------------------------------------------
##                     0 |      0.18 | 0.16, 0.20
##                     1 |      0.26 | 0.24, 0.28
##                     2 |      0.36 | 0.34, 0.39
##                     3 |      0.48 | 0.45, 0.50
##                     4 |      0.60 | 0.57, 0.62
##                     5 |      0.71 | 0.67, 0.73
##                     6 |      0.79 | 0.76, 0.82
##                     7 |      0.86 | 0.83, 0.89
## 
## Adjusted for:
## *      host_response_rate =            0.97
## *    host_acceptance_rate =            0.85
## *    review_scores_rating =            4.78
## * host_listings_count_log =            2.33
## *        availability_365 =          186.29
## *        instant_bookable =              No
## *               room_type = Entire home/apt
## *      minimum_nights_log =            2.13
## *      maximum_nights_log =            4.89

Decision Tree

While decision trees do not have the strongest predictive power, they are useful explanatory tools. Let’s try this model and compare the results with the model above.

# create parameters for interpretability
control = rpart.control(minsplit = 30, cp = 0.02)

#create decision tree
dt_simple <- rpart(host_is_superhost ~ ., data = training, method = "class", control = control)
rpart.plot(dt_simple, digits = 3)

The decision tree identifies three key thresholds for achieving superhost status. Listings with fewer than four reviews are unlikely to belong to a superhost. For those with enough reviews, ratings play a crucial role—listings with lower ratings have a reduced likelihood of superhost status. Host acceptance rate also plays a significant role, with higher rates increasing the probability of earning superhost status. These decision rules align with the logistic regression findings and offer actionable benchmarks for Airbnb hosts.

Classification (Prediction)

Benchmark Model

I’m starting with a simple benchmark model to establish a baseline for prediction. This model only uses the overall proportion of superhosts in my training data to make predictions, without considering any listing features.

# create benchmark model
bench_model = glm(host_is_superhost ~ 1, family = binomial(link = 'logit'), data = training)

# get predictions from benchmark model
probability = predict(bench_model, newdata = testing, type = "response")

# check probability distribution
summary(probability)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4171  0.4171  0.4171  0.4171  0.4171  0.4171
# create factor predictions
prediction = factor(ifelse(probability > 0.5, "Yes", "No"), levels = c("No", "Yes"))

# confusion matrix
confusionMatrix(prediction, testing$host_is_superhost)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1375  983
##        Yes    0    0
##                                           
##                Accuracy : 0.5831          
##                  95% CI : (0.5629, 0.6031)
##     No Information Rate : 0.5831          
##     P-Value [Acc > NIR] : 0.5088          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.5831          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.5831          
##          Detection Rate : 0.5831          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : No              
## 

The summary shows that my model assigns the exact same probability (0.4156 or 41.56%) to every listing in the test set. This represents the proportion of superhosts in my training data. Since this probability is below the 0.5 threshold, my model predicts “No” for every single observation. This results in an accuracy of 58.44%, which is exactly the percentage of non-superhosts in my test data. While the model correctly identifies all the “No” cases (100% sensitivity), it fails to identify any “Yes” cases (0% specificity), giving a Kappa value of 0.

Now, let’s implement models that can actually learn patterns from the data.

Bayes Classifiers

I will start my classification prediction using three Bayesian classifiers: Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Regularized Discriminant Analysis. These models classify data by finding boundaries between groups based on probability distributions. To improve performance, I will use cross-validation and tune RDA with a small grid search to save time.

LDA

# LDA with limited variables to address collinearity
ldaFit <- train(host_is_superhost ~ host_response_rate + 
                     host_acceptance_rate + 
                     review_scores_rating + 
                     number_of_reviews_log + 
                     host_listings_count_log + 
                     availability_365 + 
                     instant_bookable +
                     room_type +
                     minimum_nights_log +
                     maximum_nights_log,
               data = training,
               method = "lda", 
               preProcess = c("center", "scale"),
               metric = "ROC",
               trControl = ctrl)
print(ldaFit)
## Linear Discriminant Analysis 
## 
## 5505 samples
##   10 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (12), scaled (12) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 4405, 4403, 4404, 4404, 4404 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8164645  0.7880954  0.6542048
# predict and evaluate
ldaPred = predict(ldaFit, testing)
ldaProb = predict(ldaFit, testing, type="prob")[,"Yes"]

# confusion matrix
confusionMatrix(ldaPred, testing$host_is_superhost)$table
##           Reference
## Prediction   No  Yes
##        No  1100  337
##        Yes  275  646
confusionMatrix(ldaPred, testing$host_is_superhost)$overall[1:2]
##  Accuracy     Kappa 
## 0.7404580 0.4613195

QDA

# QDA with subset of variables to avoid rank deficiency
qdaFit <- train(host_is_superhost ~ host_acceptance_rate + 
                     review_scores_rating + 
                     number_of_reviews_log + 
                     host_listings_count_log,
                data = training,
                method = "qda", 
                preProcess = c("center", "scale"),
                metric = "ROC",
                trControl = ctrl)
print(qdaFit)
## Quadratic Discriminant Analysis 
## 
## 5505 samples
##    4 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 4404, 4404, 4404, 4403, 4405 
## Resampling results:
## 
##   ROC       Sens       Spec     
##   0.816121  0.5936455  0.8371014
# predict and evaluate
qdaPred = predict(qdaFit, testing)
qdaProb = predict(qdaFit, testing, type="prob")[,"Yes"]

# confusion matrix
confusionMatrix(qdaPred, testing$host_is_superhost)$table
##           Reference
## Prediction  No Yes
##        No  832 169
##        Yes 543 814
confusionMatrix(qdaPred, testing$host_is_superhost)$overall[1:2]
##  Accuracy     Kappa 
## 0.6980492 0.4108843

RDA

# set parameters
param_grid = expand.grid(
  gamma = seq(0, 1, by = 0.2),   
  lambda = seq(0.1, 0.9, by = 0.2)  
)

# regularized discriminant analysis (RDA)
rdaFit <- train(host_is_superhost ~ ., 
                method = "rda", 
                data = training,
                tuneGrid = param_grid,
                preProcess = c("center", "scale", "nzv"),
                metric = "ROC", 
                trControl = ctrl)
print(rdaFit)
## Regularized Discriminant Analysis 
## 
## 5505 samples
##   34 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (36), scaled (36), remove (37) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 4404, 4405, 4404, 4403, 4404 
## Resampling results across tuning parameters:
## 
##   gamma  lambda  ROC        Sens       Spec     
##   0.0    0.1     0.8327533  0.5132581  0.9242152
##   0.0    0.3     0.8330306  0.5154368  0.9215999
##   0.0    0.5     0.8345526  0.5409927  0.9146320
##   0.0    0.7     0.8362851  0.5986431  0.8963370
##   0.0    0.9     0.8339591  0.7058325  0.8144530
##   0.2    0.1     0.8301806  0.4546722  0.9464308
##   0.2    0.3     0.8299529  0.4577889  0.9433816
##   0.2    0.5     0.8302763  0.4839649  0.9359790
##   0.2    0.7     0.8301227  0.5556369  0.9111499
##   0.2    0.9     0.8255152  0.7005346  0.8096666
##   0.4    0.1     0.8244632  0.4415808  0.9459932
##   0.4    0.3     0.8236637  0.4503079  0.9438164
##   0.4    0.5     0.8231746  0.4805357  0.9346708
##   0.4    0.7     0.8219596  0.5603117  0.8993890
##   0.4    0.9     0.8170788  0.7058340  0.7852733
##   0.6    0.1     0.8172954  0.4509285  0.9412020
##   0.6    0.3     0.8161275  0.4633944  0.9346680
##   0.6    0.5     0.8148528  0.4976764  0.9176812
##   0.6    0.7     0.8126548  0.5883559  0.8736848
##   0.6    0.9     0.8075053  0.7077007  0.7504272
##   0.8    0.1     0.8052204  0.4855206  0.9150649
##   0.8    0.3     0.8042458  0.5101404  0.9033049
##   0.8    0.5     0.8028165  0.5553224  0.8793474
##   0.8    0.7     0.7999560  0.6297996  0.8209804
##   0.8    0.9     0.7950304  0.7064555  0.7195036
##   1.0    0.1     0.7738672  0.6192043  0.7891797
##   1.0    0.3     0.7777978  0.6354105  0.7765502
##   1.0    0.5     0.7798037  0.6500542  0.7582609
##   1.0    0.7     0.7797607  0.6665695  0.7282107
##   1.0    0.9     0.7772867  0.6899412  0.6938079
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were gamma = 0 and lambda = 0.7.
plot(rdaFit)

# best tuning parameters
print("Best Parameters:")
## [1] "Best Parameters:"
print(rdaFit$bestTune)
##   gamma lambda
## 4     0    0.7
# predict and validate
rdaPred = predict(rdaFit, testing)
confusionMatrix(rdaPred, testing$host_is_superhost)$table
##           Reference
## Prediction  No Yes
##        No  821 133
##        Yes 554 850
confusionMatrix(rdaPred, testing$host_is_superhost)$overall[1:2]
##  Accuracy     Kappa 
## 0.7086514 0.4352208

Since this is an Airbnb classification task, not something high-risk like a medical diagnosis, misclassifications don’t have serious consequences. Predicting superhost status incorrectly might affect recommendations or visibility on the platform, but it doesn’t lead to critical errors. Therefore, I think LDA is the best model because it provides the highest accuracy while maintaining a good balance between correctly identifying superhosts and non-superhosts.

Machine Learning Models

In this section, I apply machine learning models to predict Airbnb superhost status. I test k-Nearest Neighbors (k-NN), Random Forest, eXtreme Gradient Boosting (XGBoost), and Neural Networks, each with different strengths.

  • k-NN classifies hosts based on their closest neighbors but can struggle with high-dimensional data.
  • Random Forest uses multiple decision trees, making it robust and interpretable.
  • XGBoost improves on decision trees with boosting, but can be computationally expensive.
  • Neural Networks capture complex patterns but often require large datasets to perform well.

I will tune each model using a reduced grid search to minimize computation and evaluate them based on accuracy, while also considering sensitivity and specificity.

Nearest Neighbors

# reduced tuning grid to minimize computation
knn_grid <- expand.grid(
  kmax = c(3, 7, 11),  # odd numbers to break ties
  distance = 2,  # Euclidean distance
  kernel = "rectangular"  
)

# train KNN model 
knnFit <- train(
  host_is_superhost ~ ., 
  data = training,
  method = "kknn", 
  preProcess = c("center", "scale", "nzv"),
  tuneGrid = knn_grid,
  metric = "ROC", 
  trControl = ctrl  
)
print(knnFit)
## k-Nearest Neighbors 
## 
## 5505 samples
##   34 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (36), scaled (36), remove (37) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 4404, 4404, 4404, 4404, 4404 
## Resampling results across tuning parameters:
## 
##   kmax  ROC        Sens       Spec     
##    3    0.8373676  0.7494598  0.8153311
##    7    0.8413914  0.7469676  0.8144596
##   11    0.8440476  0.7447869  0.8148953
## 
## Tuning parameter 'distance' was held constant at a value of 2
## Tuning
##  parameter 'kernel' was held constant at a value of rectangular
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were kmax = 11, distance = 2 and kernel
##  = rectangular.
plot(knnFit)

# print best tuning parameters
best_knn_params <- knnFit$bestTune
print("Best Parameters:")
## [1] "Best Parameters:"
print(best_knn_params)
##   kmax distance      kernel
## 3   11        2 rectangular
# make predictions
knnPred <- predict(knnFit, testing)

# evaluate model performance
conf_matrix <- confusionMatrix(knnPred, testing$host_is_superhost)
print(conf_matrix$table)  
##           Reference
## Prediction   No  Yes
##        No  1049  212
##        Yes  326  771
print(conf_matrix$overall[1:2])  
##  Accuracy     Kappa 
## 0.7718405 0.5383433

Random Forest

# define tuning grid for Random Forest
rf_grid <- expand.grid(
  mtry = c(3, 6, 9, 12),  
  splitrule = "gini",  
  min.node.size = c(1, 5, 10)  
)

# train Random Forest Model with Tuning
rfFit <- train(
  host_is_superhost ~ ., 
  data = training,
  method = "ranger",   
  preProcess = c("center", "scale", "nzv"),
  tuneGrid = rf_grid,
  metric = "ROC",  
  trControl = ctrl
)
print(rfFit)
## Random Forest 
## 
## 5505 samples
##   34 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (36), scaled (36), remove (37) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 4404, 4403, 4405, 4404, 4404 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  ROC        Sens       Spec     
##    3     1             0.9237782  0.8831401  0.7979142
##    3     5             0.9213197  0.8837632  0.7900786
##    3    10             0.9198286  0.8815806  0.7857204
##    6     1             0.9226105  0.8722333  0.8014010
##    6     5             0.9217429  0.8747255  0.8040078
##    6    10             0.9191294  0.8722333  0.8018310
##    9     1             0.9213899  0.8697392  0.8087998
##    9     5             0.9205681  0.8684921  0.8079255
##    9    10             0.9179619  0.8713002  0.8040116
##   12     1             0.9208145  0.8681815  0.8087980
##   12     5             0.9196938  0.8666239  0.8087998
##   12    10             0.9172244  0.8678705  0.8022648
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 3, splitrule = gini
##  and min.node.size = 1.
plot(rfFit)

# best tuning parameters
best_rf_params <- rfFit$bestTune
print("Best Parameters:")
## [1] "Best Parameters:"
print(best_rf_params)
##   mtry splitrule min.node.size
## 1    3      gini             1
# make predictions
rfPred <- predict(rfFit, testing)

# evaluate model performance
conf_matrix <- confusionMatrix(rfPred, testing$host_is_superhost)
print(conf_matrix$table)  
##           Reference
## Prediction   No  Yes
##        No  1231  217
##        Yes  144  766
print(conf_matrix$overall[1:2])  
##  Accuracy     Kappa 
## 0.8469042 0.6817366

Gradient Boosting

# define tuning grid for Gradient Boosting
gbm_grid <- expand.grid(
  nrounds = c(50, 100),  # Reduced boosting iterations
  max_depth = c(3, 6),  # Smaller range of tree depths
  eta = c(0.1, 0.3),  # Only 2 learning rate values
  gamma = 0,  # Fixed (removes unnecessary tuning)
  colsample_bytree = 0.8,  # Fixed (removes unnecessary tuning)
  min_child_weight = 1,  # Default (prevents overfitting)
  subsample = 0.8  # Default 
)

# train XGBoost model with warning suppression
invisible(
  capture.output(
    gbmFit <- train(
      host_is_superhost ~ ., 
      data = training,
      method = "xgbTree",   
      preProcess = c("center", "scale", "nzv"),
      tuneGrid = gbm_grid,
      metric = "ROC",  
      trControl = ctrl,
      verbose = FALSE
    )
  )
)
print(gbmFit)
## eXtreme Gradient Boosting 
## 
## 5505 samples
##   34 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (36), scaled (36), remove (37) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 4403, 4404, 4405, 4404, 4404 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  nrounds  ROC        Sens       Spec     
##   0.1  3           50      0.8990544  0.8385831  0.7887601
##   0.1  3          100      0.9146694  0.8541633  0.8131448
##   0.1  6           50      0.9239731  0.8579041  0.8310022
##   0.1  6          100      0.9322681  0.8703695  0.8366761
##   0.3  3           50      0.9174082  0.8550984  0.8074889
##   0.3  3          100      0.9234058  0.8635159  0.8201184
##   0.3  6           50      0.9266903  0.8632044  0.8244814
##   0.3  6          100      0.9290018  0.8666307  0.8240504
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## 
## Tuning parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 6, eta
##  = 0.1, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and subsample
##  = 0.8.
plot(gbmFit)

# best tuning parameters
best_gbm_params <- gbmFit$bestTune
print("Best Parameters:")
## [1] "Best Parameters:"
print(best_gbm_params)
##   nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 4     100         6 0.1     0              0.8                1       0.8
# make predictions
gbmPred <- predict(gbmFit, testing)

# evaluate model performance
conf_matrix <- confusionMatrix(gbmPred, testing$host_is_superhost)
print(conf_matrix$table)  # Confusion matrix
##           Reference
## Prediction   No  Yes
##        No  1222  183
##        Yes  153  800
print(conf_matrix$overall[1:2])  # Accuracy & Kappa
##  Accuracy     Kappa 
## 0.8575064 0.7056322

Neural Networks

# define grid for Neural Networks
nn_grid <- expand.grid(
  size = c(2, 4, 6),  
  decay = c(0.001, 0.01, 0.1)  
)

# train Neural Network 
invisible(
  capture.output(
    nnFit <- train(
      host_is_superhost ~ ., 
      data = training,
      method = "nnet",  
      preProcess = c("center", "scale", "nzv"),
      trControl = ctrl,
      tuneGrid = nn_grid,
      metric = "ROC",
      trace = FALSE,  
      maxit = 100  # reduce iterations for faster training
    )
  )
)
print(nnFit)
## Neural Network 
## 
## 5505 samples
##   34 predictor
##    2 classes: 'No', 'Yes' 
## 
## Pre-processing: centered (36), scaled (36), remove (37) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 4405, 4403, 4404, 4404, 4404 
## Resampling results across tuning parameters:
## 
##   size  decay  ROC        Sens       Spec     
##   2     0.001  0.8509315  0.8027406  0.7430283
##   2     0.010  0.8485718  0.7983729  0.7447731
##   2     0.100  0.8563372  0.8124008  0.7386672
##   4     0.001  0.8442972  0.7999412  0.7469309
##   4     0.010  0.8415781  0.7909021  0.7299422
##   4     0.100  0.8585104  0.8077245  0.7460822
##   6     0.001  0.8432398  0.7921550  0.7225519
##   6     0.010  0.8480903  0.7912199  0.7447504
##   6     0.100  0.8660614  0.8217495  0.7495377
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were size = 6 and decay = 0.1.
plot(nnFit)

# best tuning parameters
best_nn_params <- nnFit$bestTune
print("Best Parameters:")
## [1] "Best Parameters:"
print(best_nn_params)
##   size decay
## 9    6   0.1
# make predictions
nnPred <- predict(nnFit, testing)

# evaluate model performance
conf_matrix <- confusionMatrix(nnPred, testing$host_is_superhost)
print(conf_matrix$table)  
##           Reference
## Prediction   No  Yes
##        No  1137  271
##        Yes  238  712
print(conf_matrix$overall[1:2])  
##  Accuracy     Kappa 
## 0.7841391 0.5538729

Gradient boosting provided the best accuracy and predictive power of the ML models, making it the most effective model for this task. Random Forest was not quite as accurate, but was a close second and was able to run more quickly. Neural Networks performed decently well, but not as well as Gradient Boosting and Random Forest. This makes sense since deep learning models are generally not well-suited for tabular data, and reinforces that tree-based models often work best for structured datasets.

Comparing Models

Now lets compare all 7 of the Bayes classification models and ML models against one another.

# model Comparison
# get probabilities for each model
ldaProb = predict(ldaFit, testing, type="prob")[,"Yes"]
qdaProb = predict(qdaFit, testing, type="prob")[,"Yes"]
rdaProb = predict(rdaFit, testing, type="prob")[,"Yes"]
knnProb = predict(knnFit, testing, type="prob")[,"Yes"]
rfProb = predict(rfFit, testing, type="prob")[,"Yes"]
gbmProb = predict(gbmFit, testing, type="prob")[,"Yes"]
nnProb = predict(nnFit, testing, type="prob")[,"Yes"]

# Calculate ROC curves
library(pROC)
roc_lda = roc(testing$host_is_superhost == "Yes" ~ ldaProb)
roc_qda = roc(testing$host_is_superhost == "Yes" ~ qdaProb)
roc_rda = roc(testing$host_is_superhost == "Yes" ~ rdaProb)
roc_knn = roc(testing$host_is_superhost == "Yes" ~ knnProb)
roc_rf = roc(testing$host_is_superhost == "Yes" ~ rfProb)
roc_gbm = roc(testing$host_is_superhost == "Yes" ~ gbmProb)
roc_nn = roc(testing$host_is_superhost == "Yes" ~ nnProb)

# Plot ROC curves
plot(roc_lda, col="darkgreen", main="ROC Curves for Different Models")
plot(roc_qda, col="orange", add=TRUE)
plot(roc_rda, col="blue", add=TRUE)
plot(roc_knn, col="red", add=TRUE)
plot(roc_rf, col="green", add=TRUE)
plot(roc_gbm, col="purple", add=TRUE)
plot(roc_nn, col="brown", add=TRUE)
legend("bottomright", 
       legend=c("LDA", "QDA", "RDA", "KNN", "RF", "GBM", "NN"), 
       col=c("darkgreen", "orange", "blue", "red", "green", "purple", "brown"), 
       lwd=2)

# compare AUC values
auc_values <- data.frame(
  Model = c("LDA", "QDA", "RDA", "KNN", "RF", "GBM", "NN"),
  AUC = c(roc_lda$auc, roc_qda$auc, roc_rda$auc, roc_knn$auc, roc_rf$auc, roc_gbm$auc, roc_nn$auc)
)
auc_values <- auc_values[order(-auc_values$AUC),]
print(auc_values)
##   Model       AUC
## 6   GBM 0.9387915
## 5    RF 0.9280307
## 7    NN 0.8547526
## 4   KNN 0.8396156
## 3   RDA 0.8292760
## 1   LDA 0.8220364
## 2   QDA 0.8156212
# variable importance for best model (Gradient Boosting)
gbm_imp <- varImp(gbmFit, scale = FALSE)
plot(gbm_imp, top = 15, main = "Top 15 Most Important Variables (GBM)")

# get top variables from importance plot
important_vars <- rownames(gbm_imp$importance)[order(-gbm_imp$importance[,1])][1:3]

# create partial dependence plots for all top variables
for(var in important_vars) {
  pdp_plot <- partial(gbmFit, pred.var = var, which.class = 2, plot = TRUE, 
                      prob = TRUE, rug = TRUE)
  print(pdp_plot)
}  

The ROC curve and AUC scores confirm that Gradient Boosting performed the best, achieving the highest AUC, followed by Random Forest. These models were the most effective at distinguishing superhosts from non-superhosts. Overall, the ML models performed better than the Bayesian Classifiers.

The feature importance plot for Gradient Boosting, the top performing model, shows that number of reviews, host listing count, and acceptance rate are the most influential factors in predicting superhost status. The partial dependence plots suggest that hosts with more reviews and higher acceptance rates are more likely to be superhosts, while hosts with an extremely high number of listings may be less likely. These findings align with those we saw earlier in the simple linear model and decision tree.

Threshold Optimization for Best Model

Next I will adjust the classification threshold for the best performing model, Gradient Boosting, to see how different probability cutoffs affect accuracy and the balance between false positives and false negatives. This helps determine the optimal threshold that best distinguishes superhosts from non-superhosts while minimizing misclassification errors.

#threshold Optimization
threshold = 0.3  
gbmProb = predict(gbmFit, testing, type="prob")
gbm_adjusted_pred <- as.factor(ifelse(gbmProb[,"Yes"] > threshold, "Yes", "No"))

# check confusion matrix with adjusted threshold
print(paste("GBM with threshold =", threshold))
## [1] "GBM with threshold = 0.3"
confusionMatrix(gbm_adjusted_pred, testing$host_is_superhost)$table
##           Reference
## Prediction   No  Yes
##        No  1120   86
##        Yes  255  897
confusionMatrix(gbm_adjusted_pred, testing$host_is_superhost)$overall[1:2]
##  Accuracy     Kappa 
## 0.8553859 0.7096665
#threshold Optimization
threshold = 0.4  
gbmProb = predict(gbmFit, testing, type="prob")
gbm_adjusted_pred <- as.factor(ifelse(gbmProb[,"Yes"] > threshold, "Yes", "No"))

# check confusion matrix with adjusted threshold
print(paste("GBM with threshold =", threshold))
## [1] "GBM with threshold = 0.4"
confusionMatrix(gbm_adjusted_pred, testing$host_is_superhost)$table
##           Reference
## Prediction   No  Yes
##        No  1176  133
##        Yes  199  850
confusionMatrix(gbm_adjusted_pred, testing$host_is_superhost)$overall[1:2]
##  Accuracy     Kappa 
## 0.8592027 0.7131473
#threshold Optimization
threshold = 0.5  
gbmProb = predict(gbmFit, testing, type="prob")
gbm_adjusted_pred <- as.factor(ifelse(gbmProb[,"Yes"] > threshold, "Yes", "No"))

# check confusion matrix with adjusted threshold
print(paste("GBM with threshold =", threshold))
## [1] "GBM with threshold = 0.5"
confusionMatrix(gbm_adjusted_pred, testing$host_is_superhost)$table
##           Reference
## Prediction   No  Yes
##        No  1222  183
##        Yes  153  800
confusionMatrix(gbm_adjusted_pred, testing$host_is_superhost)$overall[1:2]
##  Accuracy     Kappa 
## 0.8575064 0.7056322

Adjusting the classification threshold in XGBoost (GBM) impacts accuracy and the balance between false positives and false negatives. A lower threshold (0.3) classifies more hosts as superhosts, increasing recall but also leading to more false positives. Raising the threshold to 0.4 reduces false positives and achieves the highest accuracy. The default threshold of 0.5 is the most conservative, minimizing false positives but increasing false negatives. These results suggest that 0.4 is the optimal threshold for this dataset, striking the best balance between capturing true superhosts and avoiding misclassification.

Ensemble Model

Finally, I will using an ensemble model, which combines multiple classifiers to improve overall prediction performance. The goal is to leverage the strengths of different models to achieve better accuracy and robustness.

# ensemble model
# first get the probabilities for the "Yes" class from top models
rf_probs <- predict(rfFit, testing, type="prob")[,"Yes"]
gbm_probs <- predict(gbmFit, testing, type="prob")[,"Yes"]
knn_probs <- predict(knnFit, testing, type="prob")[,"Yes"]
nn_probs <- predict(nnFit, testing, type="prob")[,"Yes"]

# create ensemble by simple averaging
ensemble_probs <- (rf_probs + gbm_probs + knn_probs + nn_probs)/4

# try a specific threshold 
threshold <- 0.3
ensemble_preds <- as.factor(ifelse(ensemble_probs > threshold, "Yes", "No"))
print("Ensemble model:")
## [1] "Ensemble model:"
confusionMatrix(ensemble_preds, testing$host_is_superhost)$table
##           Reference
## Prediction  No Yes
##        No  994  71
##        Yes 381 912
confusionMatrix(ensemble_preds, testing$host_is_superhost)$overall[1:2]
##  Accuracy     Kappa 
## 0.8083121 0.6226893
# try another threshold 
threshold <- 0.4
ensemble_preds <- as.factor(ifelse(ensemble_probs > threshold, "Yes", "No"))
print("Ensemble model:")
## [1] "Ensemble model:"
confusionMatrix(ensemble_preds, testing$host_is_superhost)$table
##           Reference
## Prediction   No  Yes
##        No  1122  125
##        Yes  253  858
confusionMatrix(ensemble_preds, testing$host_is_superhost)$overall[1:2]
##  Accuracy     Kappa 
## 0.8396947 0.6762855
# try a third threshold
threshold <- 0.5
ensemble_preds <- as.factor(ifelse(ensemble_probs > threshold, "Yes", "No"))
print(paste("Ensemble with threshold =", threshold))
## [1] "Ensemble with threshold = 0.5"
confusionMatrix(ensemble_preds, testing$host_is_superhost)$table
##           Reference
## Prediction   No  Yes
##        No  1195  194
##        Yes  180  789
confusionMatrix(ensemble_preds, testing$host_is_superhost)$overall[1:2]
##  Accuracy     Kappa 
## 0.8413910 0.6731024

I created the ensemble model by combining the four best-performing models: XGBoost, Random Forest, k-NN, and Neural Networks. After merging their predictions, I adjusted the threshold, which sets the probability cutoff for classifying a host as a superhost. The accuracy of the ensemble at the optimal threshold is very close to that of the top-performing individual model, Gradient Boosting, and the ensemble is likely more robust since it balances multiple models strengths rather than relying on just one.

Regression (Interpretation)

Data Splitting

I will use the same 70/30 split for regression as I did for classification, but change the target variable.

# split the data
in_train <- createDataPartition(airbnb_final$price, p = 0.7, list = FALSE)
training <- airbnb_final[in_train,]
testing <- airbnb_final[-in_train,]
nrow(training)
## [1] 5506
nrow(testing)
## [1] 2357

Once again, I’ll also use 5 fold cross validation, which will save me some computational efficiency.

# set up 5-fold cross-validation for regression
ctrl <- trainControl(method = "repeatedcv", 
                     number = 5, 
                     repeats = 1,
                     verboseIter = FALSE,
                     returnResamp = "all",
                     summaryFunction = defaultSummary) 

Linear regression

Similarly to classification, I will use a basic linear model to understand the most important predictors of my regression target, price. I will start by using all the variables.

linFit <- lm(log(price) ~ . - latitude - longitude - price, data = training) #only included coord interaction terms to prevent multicollinearity
summary(linFit)
## 
## Call:
## lm(formula = log(price) ~ . - latitude - longitude - price, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8555 -0.1006  0.0285  0.1136  4.1400 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                 -4.740e+01  3.501e+01  -1.354
## host_is_superhostYes                         1.895e-02  7.918e-03   2.393
## neighbourhood_cleansedBernal Heights         2.305e-02  3.304e-02   0.698
## neighbourhood_cleansedCastro/Upper Market    7.979e-02  4.499e-02   1.774
## neighbourhood_cleansedChinatown             -6.975e-02  5.792e-02  -1.204
## neighbourhood_cleansedCrocker Amazon        -1.393e-01  5.326e-02  -2.615
## neighbourhood_cleansedDiamond Heights       -4.444e-02  8.073e-02  -0.551
## neighbourhood_cleansedDowntown/Civic Center  5.238e-02  5.118e-02   1.023
## neighbourhood_cleansedExcelsior             -5.759e-02  3.546e-02  -1.624
## neighbourhood_cleansedFinancial District     9.593e-03  5.409e-02   0.177
## neighbourhood_cleansedGlen Park              2.878e-02  4.916e-02   0.585
## neighbourhood_cleansedGolden Gate Park      -5.470e-02  8.521e-02  -0.642
## neighbourhood_cleansedHaight Ashbury         2.687e-02  5.145e-02   0.522
## neighbourhood_cleansedInner Richmond        -2.882e-02  6.401e-02  -0.450
## neighbourhood_cleansedInner Sunset          -2.353e-02  5.608e-02  -0.420
## neighbourhood_cleansedLakeshore             -1.181e-01  6.233e-02  -1.894
## neighbourhood_cleansedMarina                 4.256e-02  6.685e-02   0.637
## neighbourhood_cleansedMission                9.943e-03  3.717e-02   0.267
## neighbourhood_cleansedNob Hill               5.387e-02  5.657e-02   0.952
## neighbourhood_cleansedNoe Valley             3.311e-02  4.031e-02   0.821
## neighbourhood_cleansedNorth Beach           -1.755e-02  6.499e-02  -0.270
## neighbourhood_cleansedOcean View            -6.947e-02  4.897e-02  -1.419
## neighbourhood_cleansedOuter Mission         -1.779e-02  4.206e-02  -0.423
## neighbourhood_cleansedOuter Richmond        -7.568e-02  7.287e-02  -1.039
## neighbourhood_cleansedOuter Sunset          -8.322e-02  6.212e-02  -1.340
## neighbourhood_cleansedPacific Heights        7.942e-02  6.223e-02   1.276
## neighbourhood_cleansedParkside              -5.170e-02  5.888e-02  -0.878
## neighbourhood_cleansedPotrero Hill           7.302e-02  4.006e-02   1.823
## neighbourhood_cleansedPresidio              -2.276e-02  9.884e-02  -0.230
## neighbourhood_cleansedPresidio Heights      -5.326e-02  7.884e-02  -0.676
## neighbourhood_cleansedRussian Hill           2.741e-02  6.494e-02   0.422
## neighbourhood_cleansedSeacliff              -1.785e-02  1.018e-01  -0.175
## neighbourhood_cleansedSouth of Market        7.259e-02  4.636e-02   1.566
## neighbourhood_cleansedTwin Peaks            -3.545e-02  5.924e-02  -0.598
## neighbourhood_cleansedVisitacion Valley     -2.382e-02  4.243e-02  -0.561
## neighbourhood_cleansedWest of Twin Peaks    -3.408e-02  4.934e-02  -0.691
## neighbourhood_cleansedWestern Addition       4.138e-03  5.115e-02   0.081
## room_typeHotel room                         -3.744e-02  2.955e-02  -1.267
## room_typePrivate room                       -1.199e-01  8.527e-03 -14.060
## room_typeShared room                        -1.102e-01  3.886e-02  -2.835
## accommodates                                 1.688e-02  3.330e-03   5.068
## bathrooms                                   -2.555e-02  5.258e-03  -4.860
## bedrooms                                     2.265e-03  6.010e-03   0.377
## beds                                        -1.139e-02  4.781e-03  -2.382
## host_response_timewithin a few hours        -1.741e-02  1.005e-02  -1.733
## host_response_timewithin a day              -1.054e-03  1.410e-02  -0.075
## host_response_timea few days or more         8.246e-02  5.099e-02   1.617
## host_response_rate                           6.913e-02  5.018e-02   1.378
## host_acceptance_rate                        -5.865e-02  1.772e-02  -3.309
## host_identity_verifiedYes                   -7.476e-03  1.210e-02  -0.618
## host_has_profile_picYes                      1.148e-02  3.194e-02   0.360
## host_listings_count                          2.182e-05  6.354e-06   3.434
## review_scores_rating                        -2.996e-02  1.845e-02  -1.623
## review_scores_cleanliness                    6.687e-02  1.283e-02   5.214
## review_scores_checkin                       -3.074e-02  1.472e-02  -2.088
## review_scores_communication                  4.950e-03  1.487e-02   0.333
## review_scores_location                       3.445e-02  1.106e-02   3.113
## review_scores_value                         -2.840e-02  1.433e-02  -1.982
## number_of_reviews                            1.935e-06  4.359e-05   0.044
## minimum_nights                               1.060e-04  1.141e-04   0.929
## maximum_nights                              -1.452e-04  7.042e-05  -2.063
## availability_365                            -3.444e-05  2.631e-05  -1.309
## instant_bookableYes                          4.809e-02  8.477e-03   5.673
## host_tenure_days                             6.114e-06  2.737e-06   2.234
## host_listings_count_log                     -1.334e-02  3.144e-03  -4.243
## number_of_reviews_log                       -1.050e-03  3.062e-03  -0.343
## minimum_nights_log                          -2.159e-02  4.577e-03  -4.716
## maximum_nights_log                           1.567e-02  8.799e-03   1.781
## coords_squared                               2.049e-03  2.310e-03   0.887
## coords_mult                                 -3.846e-03  6.675e-03  -0.576
## price_capped                                 5.135e-03  3.558e-05 144.303
##                                             Pr(>|t|)    
## (Intercept)                                 0.175823    
## host_is_superhostYes                        0.016731 *  
## neighbourhood_cleansedBernal Heights        0.485382    
## neighbourhood_cleansedCastro/Upper Market   0.076190 .  
## neighbourhood_cleansedChinatown             0.228549    
## neighbourhood_cleansedCrocker Amazon        0.008939 ** 
## neighbourhood_cleansedDiamond Heights       0.581985    
## neighbourhood_cleansedDowntown/Civic Center 0.306141    
## neighbourhood_cleansedExcelsior             0.104388    
## neighbourhood_cleansedFinancial District    0.859242    
## neighbourhood_cleansedGlen Park             0.558292    
## neighbourhood_cleansedGolden Gate Park      0.520952    
## neighbourhood_cleansedHaight Ashbury        0.601551    
## neighbourhood_cleansedInner Richmond        0.652549    
## neighbourhood_cleansedInner Sunset          0.674841    
## neighbourhood_cleansedLakeshore             0.058237 .  
## neighbourhood_cleansedMarina                0.524435    
## neighbourhood_cleansedMission               0.789108    
## neighbourhood_cleansedNob Hill              0.340976    
## neighbourhood_cleansedNoe Valley            0.411507    
## neighbourhood_cleansedNorth Beach           0.787082    
## neighbourhood_cleansedOcean View            0.156083    
## neighbourhood_cleansedOuter Mission         0.672365    
## neighbourhood_cleansedOuter Richmond        0.299023    
## neighbourhood_cleansedOuter Sunset          0.180372    
## neighbourhood_cleansedPacific Heights       0.201937    
## neighbourhood_cleansedParkside              0.380006    
## neighbourhood_cleansedPotrero Hill          0.068385 .  
## neighbourhood_cleansedPresidio              0.817930    
## neighbourhood_cleansedPresidio Heights      0.499371    
## neighbourhood_cleansedRussian Hill          0.672977    
## neighbourhood_cleansedSeacliff              0.860812    
## neighbourhood_cleansedSouth of Market       0.117433    
## neighbourhood_cleansedTwin Peaks            0.549587    
## neighbourhood_cleansedVisitacion Valley     0.574582    
## neighbourhood_cleansedWest of Twin Peaks    0.489857    
## neighbourhood_cleansedWestern Addition      0.935524    
## room_typeHotel room                         0.205201    
## room_typePrivate room                        < 2e-16 ***
## room_typeShared room                        0.004598 ** 
## accommodates                                4.15e-07 ***
## bathrooms                                   1.21e-06 ***
## bedrooms                                    0.706250    
## beds                                        0.017252 *  
## host_response_timewithin a few hours        0.083169 .  
## host_response_timewithin a day              0.940434    
## host_response_timea few days or more        0.105869    
## host_response_rate                          0.168404    
## host_acceptance_rate                        0.000943 ***
## host_identity_verifiedYes                   0.536678    
## host_has_profile_picYes                     0.719226    
## host_listings_count                         0.000600 ***
## review_scores_rating                        0.104595    
## review_scores_cleanliness                   1.92e-07 ***
## review_scores_checkin                       0.036882 *  
## review_scores_communication                 0.739195    
## review_scores_location                      0.001859 ** 
## review_scores_value                         0.047508 *  
## number_of_reviews                           0.964592    
## minimum_nights                              0.352801    
## maximum_nights                              0.039191 *  
## availability_365                            0.190597    
## instant_bookableYes                         1.48e-08 ***
## host_tenure_days                            0.025548 *  
## host_listings_count_log                     2.24e-05 ***
## number_of_reviews_log                       0.731793    
## minimum_nights_log                          2.46e-06 ***
## maximum_nights_log                          0.074910 .  
## coords_squared                              0.375234    
## coords_mult                                 0.564483    
## price_capped                                 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2382 on 5435 degrees of freedom
## Multiple R-squared:  0.9027, Adjusted R-squared:  0.9015 
## F-statistic: 720.5 on 70 and 5435 DF,  p-value: < 2.2e-16

Then I will create a simplified model using the most relevant predictors.

# Create a simplified model with key significant predictors
simple_model <- lm(
  log(price) ~ room_type + bathrooms + accommodates + host_listings_count + host_acceptance_rate + review_scores_location + number_of_reviews + availability_365 + minimum_nights_log,
  data = training
)

summary(simple_model)
## 
## Call:
## lm(formula = log(price) ~ room_type + bathrooms + accommodates + 
##     host_listings_count + host_acceptance_rate + review_scores_location + 
##     number_of_reviews + availability_365 + minimum_nights_log, 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0932 -0.3332 -0.0549  0.2816  5.9028 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.583e+00  1.065e-01  43.017  < 2e-16 ***
## room_typeHotel room    -3.761e-01  6.963e-02  -5.401 6.89e-08 ***
## room_typePrivate room  -4.076e-01  1.924e-02 -21.183  < 2e-16 ***
## room_typeShared room   -3.252e-01  9.111e-02  -3.569 0.000361 ***
## bathrooms               5.492e-02  1.138e-02   4.824 1.44e-06 ***
## accommodates            1.491e-01  4.840e-03  30.802  < 2e-16 ***
## host_listings_count     7.635e-05  9.735e-06   7.843 5.26e-15 ***
## host_acceptance_rate   -2.978e-01  3.436e-02  -8.667  < 2e-16 ***
## review_scores_location  1.388e-01  1.992e-02   6.970 3.54e-12 ***
## number_of_reviews      -6.643e-04  7.583e-05  -8.761  < 2e-16 ***
## availability_365       -2.992e-04  6.042e-05  -4.952 7.55e-07 ***
## minimum_nights_log     -1.174e-01  6.951e-03 -16.885  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5802 on 5494 degrees of freedom
## Multiple R-squared:  0.4167, Adjusted R-squared:  0.4155 
## F-statistic: 356.7 on 11 and 5494 DF,  p-value: < 2.2e-16

This model highlights key factors influencing Airbnb pricing. Entire homes/apartments are priced higher than other room types, and larger accommodations with higher guest capacity increase price. Hosts with multiple listings tend to charge more, while those with higher acceptance rates set lower prices, possibly reflecting different hosting strategies. High location ratings drive up price, while greater availability, more reviews, and longer minimum stays tend to lower it. This suggests that exclusivity, guest capacity, and perceived quality play a major role in pricing decisions.

Decision Tree

I will also create a decision tree to visualize the most predictive variables.

# define control parameters for interpretability
control <- rpart.control(minsplit = 30, cp = 0.02)

# create a regression decision tree
dt_regression <- rpart(price ~ ., data = training, method = "anova", control = control)

# plot the regression tree
rpart.plot(dt_regression, digits = 3)

This decision tree shows that number of bathrooms is a very influential factor when determining airbnb price. Check-in review scores and location, represented by coords_mult, also plays an important role.

Regression (Prediction)

Benchmark Model

Once again I will start with a simple benchmark model to establish a baseline for comparison.

# calculate the mean price
mean_price <- mean(training$price)

# create a benchmark model that predicts the mean price for all observations
benchFit <- lm(price ~ 1, data = training)

# predict on the testing set
predictions <- rep(mean_price, nrow(testing))

# calculate R-squared
r_squared <- 1 - (sum((testing$price - predictions)^2) / sum((testing$price - mean(testing$price))^2))

# calculate RMSE
RMSE <- sqrt(mean((predictions - testing$price)^2))

# print results
cat("Mean Price:", mean_price, "\n")
## Mean Price: 250.9025
cat("R-squared:", r_squared, "\n")
## R-squared: -0.002692992
cat("RMSE:", RMSE, "\n")
## RMSE: 556.825

The R-squared value is essentially zero, indicating that the model explains none of the variation in price—this is expected since it does not include any explanatory variables.

Regularization Models

Now that I have established a baseline, I can start to make predictions. I will first use regularization models including Ridge, Lasso, and Elastic Net. These models help improve prediction accuracy by preventing overfitting, especially when dealing with many correlated variables.

  • Ridge Regression shrinks coefficients toward zero but keeps all predictors, helping when features are highly correlated.

  • Lasso Regression can shrink some coefficients to exactly zero, effectively performing feature selection by removing less important variables.

  • Elastic Net combines Ridge and Lasso, balancing between keeping useful variables and eliminating redundant ones.

By tuning each model’s penalty parameters, I aim to find the best balance between bias and variance, improving predictive performance while avoiding unnecessary complexity.

Ridge Regression

# remove zero variance column
training_cleaned <- training |> 
  dplyr::select(-neighbourhood_cleansed)

# Define tuning grid for Ridge Regression
ridge_grid <- expand.grid(lambda = seq(0, 0.1, length = 100))  

# tune Ridge Regression
ridge_tune_explore <- train(log(price) ~ ., 
                            data = training_cleaned,
                            method = 'ridge', 
                            preProc = c('scale', 'center'),
                            tuneGrid = ridge_grid,
                            trControl = ctrl)

# plot the RMSE vs. Lambda
plot(ridge_tune_explore)  

# select the best lambda
best_lambda <- ridge_tune_explore$bestTune$lambda
cat("Best lambda:", best_lambda, "\n")
## Best lambda: 0
# retrain Ridge with the best lambda
ridge_tune <- train(log(price) ~ ., 
                    data = training_cleaned,
                    method = 'ridge', 
                    preProc = c('scale', 'center'),
                    tuneGrid = expand.grid(lambda = best_lambda),  
                    trControl = ctrl)

# predict and evaluate
testing_cleaned <- testing |> 
  dplyr::select(-neighbourhood_cleansed)

ridge_pred <- predict(ridge_tune, newdata = testing_cleaned)
ridge_rmse <- sqrt(mean((exp(ridge_pred) - testing$price)^2))
ridge_r2 <- cor(exp(ridge_pred), testing$price)^2

# print results
cat("Ridge Regression RMSE: $", round(ridge_rmse, 2), "\n")
## Ridge Regression RMSE: $ 513.7
cat("Ridge Regression R²: ", round(ridge_r2, 4), "\n")
## Ridge Regression R²:  0.1467

Lasso Regression

# define tuning grid for Lasso Regression
lasso_grid <- expand.grid(alpha = 1,  
                          lambda = seq(0.0001, 1, length = 100))  

# tune Lasso Regression
lasso_tune_explore <- train(log(price) ~ ., 
                            data = training_cleaned,
                            method = 'glmnet',  # Lasso in caret
                            preProc = c('scale', 'center'),
                            tuneGrid = lasso_grid,
                            trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
# plot the RMSE vs. Lambda
plot(lasso_tune_explore)  

# select the best lambda
best_lambda <- lasso_tune_explore$bestTune$lambda
cat("Best lambda:", best_lambda, "\n")
## Best lambda: 1e-04
# retrain Lasso with the best lambda
lasso_tune <- train(log(price) ~ ., 
                    data = training_cleaned,
                    method = 'glmnet',
                    preProc = c('scale', 'center'),
                    tuneGrid = expand.grid(alpha = 1, lambda = best_lambda),  
                    trControl = ctrl)

# predict and evaluate
lasso_pred <- predict(lasso_tune, newdata = testing_cleaned)
lasso_rmse <- sqrt(mean((exp(lasso_pred) - testing$price)^2))
lasso_r2 <- cor(exp(lasso_pred), testing$price)^2

# print results
cat("Lasso Regression RMSE: $", round(lasso_rmse, 2), "\n")
## Lasso Regression RMSE: $ 513.49
cat("Lasso Regression R²: ", round(lasso_r2, 4), "\n")
## Lasso Regression R²:  0.1474

Elastic Net

## define tuning grid for Elastic Net (alpha = 0: Ridge, alpha = 1: Lasso)
elastic_grid <- expand.grid(alpha = seq(0, 1, length = 10), 
                             lambda = seq(0.0001, 1, length = 100))  

# tune Elastic Net
elastic_tune_explore <- train(log(price) ~ ., 
                              data = training_cleaned,
                              method = 'glmnet',
                              preProc = c('scale', 'center'),
                              tuneGrid = elastic_grid,
                              trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
# plot RMSE vs. alpha & lambda
plot(elastic_tune_explore)

# select the best tuning parameters
best_alpha <- elastic_tune_explore$bestTune$alpha
best_lambda <- elastic_tune_explore$bestTune$lambda
cat("Best alpha:", best_alpha, "\n")
## Best alpha: 0.4444444
cat("Best lambda:", best_lambda, "\n")
## Best lambda: 1e-04
# retrain Elastic Net with the best alpha & lambda
elastic_tune <- train(log(price) ~ ., 
                      data = training_cleaned,
                      method = 'glmnet',
                      preProc = c('scale', 'center'),
                      tuneGrid = expand.grid(alpha = best_alpha, lambda = best_lambda),  
                      trControl = ctrl)

# prepare test data
testing_cleaned <- testing |> 
  dplyr::select(-neighbourhood_cleansed)

# predict and evaluate
elastic_pred <- predict(elastic_tune, newdata = testing_cleaned)
elastic_rmse <- sqrt(mean((exp(elastic_pred) - testing$price)^2))
elastic_r2 <- cor(exp(elastic_pred), testing$price)^2

# print results
cat("Elastic Net RMSE: $", round(elastic_rmse, 2), "\n")
## Elastic Net RMSE: $ 513.47
cat("Elastic Net R²: ", round(elastic_r2, 4), "\n")
## Elastic Net R²:  0.1475

All three regularization models performed similarly, which suggests that the best model would depend on whether interpretability (Lasso) or retaining all predictors (Ridge) is more important for the analysis.

Machine Learning Models

Next I will move on to machine learning models. I will apply the same models I used to predict superhosts, Nearest Neighbors, Random Forest, Gradient Boosting, and Neural Networks, but this time for a regression task.

# transform price
training$log_price <- log(training$price)
testing$log_price <- log(testing$price)

Nearest Neighbors

# define proper tuning grid for kknn
knn_grid <- expand.grid(
  kmax = seq(5, 30, by = 2),  # Odd values for k (avoid ties)
  distance = 2,  # Euclidean distance (1 = Manhattan, 2 = Euclidean)
  kernel = "optimal"  
)

# train KNN regression with tuning
knnFit <- train(log_price ~ ., 
                data = training,
                method = "kknn", 
                preProcess = c("center", "scale", "nzv"),
                tuneGrid = knn_grid,
                metric = "RMSE",
                trControl = ctrl)

# plot RMSE vs. kmax
plot(knnFit)

# best tuning parameters
best_k <- knnFit$bestTune$kmax
cat("Best k:", best_k, "\n")
## Best k: 29
# retrain using the best k
knnFit_final <- train(log_price ~ ., 
                      data = training,
                      method = "kknn", 
                      preProcess = c("center", "scale", "nzv"),
                      tuneGrid = expand.grid(
                        kmax = best_k, 
                        distance = 2, 
                        kernel = "optimal"
                      ),
                      metric = "RMSE",
                      trControl = ctrl)

# predict and evaluate
knnPred <- exp(predict(knnFit_final, testing))
knn_rmse <- sqrt(mean((knnPred - testing$price)^2))
knn_r2 <- cor(knnPred, testing$price)^2

# print results
cat("KNN Regression RMSE: $", round(knn_rmse, 2), "\n")
## KNN Regression RMSE: $ 336.43
cat("KNN Regression R²: ", round(knn_r2, 4), "\n")
## KNN Regression R²:  0.8859

Random Forest

# define tuning grid for Random Forest
rf_grid <- expand.grid(
  mtry = seq(2, 12, by = 2),  
  splitrule = "variance",  
  min.node.size = c(5, 10, 15, 20)  
)

# train Random Forest model
rfFit_explore <- train(log_price ~ ., 
                       data = training,
                       method = "ranger",   
                       preProcess = c("center", "scale", "nzv"),
                       tuneGrid = rf_grid,
                       metric = "RMSE",
                       trControl = ctrl)

# plot RMSE vs. parameters
plot(rfFit_explore)

# best tuning parameters
best_mtry <- rfFit_explore$bestTune$mtry
best_min_node_size <- rfFit_explore$bestTune$min.node.size
cat("Best mtry:", best_mtry, "\n")
## Best mtry: 12
cat("Best min.node.size:", best_min_node_size, "\n")
## Best min.node.size: 5
# retrain with best parameters
rfFit <- train(log_price ~ ., 
               data = training,
               method = "ranger",   
               preProcess = c("center", "scale", "nzv"),
               tuneGrid = expand.grid(
                 mtry = best_mtry,
                 splitrule = "variance",
                 min.node.size = best_min_node_size
               ),
               metric = "RMSE",
               trControl = ctrl)

# predict and evaluate
rfPred <- exp(predict(rfFit, testing))
rf_rmse <- sqrt(mean((rfPred - testing$price)^2))
rf_r2 <- cor(rfPred, testing$price)^2

# print results
cat("Random Forest RMSE: $", round(rf_rmse, 2), "\n")
## Random Forest RMSE: $ 443.52
cat("Random Forest R²: ", round(rf_r2, 4), "\n")
## Random Forest R²:  0.4339

Gradient Boosting

# Define tuning grid for Gradient Boosting
# NOTE: I ran this model multiple times with progressively more aggressive regularization to try and prevent overfitting
xgb_grid <- expand.grid(
  nrounds = c(50, 100, 200),  # reduce boosting rounds (prevents excessive complexity)
  max_depth = c(2, 4),  # shallower trees force generalization
  eta = c(0.005, 0.01, 0.05),  # slower learning to prevent overfitting 
  gamma = c(5, 10),  # requires more significant splits (reduces unnecessary complexity)
  colsample_bytree = c(0.4, 0.6),  # uses fewer features per tree (adds randomness)
  min_child_weight = c(10, 20),  # forces larger leaf sizes (prevents fitting to small patterns)
  subsample = c(0.4, 0.6)  # uses only part of the data for each tree (prevents over-reliance on full dataset)
)

  # train Gradient Boosting model
suppressWarnings(
  invisible(
    capture.output(
      gbmFit_explore <- train(log_price ~ ., 
                              data = training,
                              method = "xgbTree",   
                              preProcess = c("center", "scale", "nzv"),
                              tuneGrid = xgb_grid,
                              metric = "RMSE",
                              trControl = ctrl,
                              verbose = FALSE)
    )
  )
)

# plot RMSE vs. parameters
plot(gbmFit_explore)

# best tuning parameters
best_params <- gbmFit_explore$bestTune
print("Best Parameters:")
## [1] "Best Parameters:"
print(best_params)
##     nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 210     200         2 0.05     5              0.6               10       0.6
# retrain XGBoost with the best parameters 
suppressWarnings(
  invisible(
    capture.output(
      gbmFit <- train(log_price ~ ., 
                      data = training,
                      method = "xgbTree",   
                      preProcess = c("center", "scale", "nzv"),
                      tuneGrid = best_params,  # Use best parameters
                      metric = "RMSE",
                      trControl = ctrl,
                      verbose = FALSE)
    )
  )
)

# predict and evaluate
gbmPred <- exp(predict(gbmFit, testing))
gbm_rmse <- sqrt(mean((gbmPred - testing$price)^2))
gbm_r2 <- cor(gbmPred, testing$price)^2

# print results
cat("Gradient Boosting RMSE: $", round(gbm_rmse, 2), "\n")
## Gradient Boosting RMSE: $ 391.9
cat("Gradient Boosting R²: ", round(gbm_r2, 4), "\n")
## Gradient Boosting R²:  0.6893

Neural Networks

# define tuning grid for Neural Network
nn_grid <- expand.grid(
  size = c(2, 4, 6, 8, 10),  
  decay = c(0.0001, 0.001, 0.01, 0.1, 0.5)  
)

# train Neural Network Model
nnFit_explore <- train(log_price ~ ., 
                       data = training,
                       method = "nnet",  
                       preProcess = c("center", "scale", "nzv"),
                       trControl = ctrl,
                       tuneGrid = nn_grid,
                       metric = "RMSE",
                       trace = FALSE,  
                       maxit = 500)  
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
# plot RMSE vs. Parameters
plot(nnFit_explore)

# best tuning parameters
best_nn_size <- nnFit_explore$bestTune$size
best_nn_decay <- nnFit_explore$bestTune$decay
cat("Best size:", best_nn_size, "\n")
## Best size: 10
cat("Best decay:", best_nn_decay, "\n")
## Best decay: 1e-04
# retrain with Best Parameters
nnFit <- train(log_price ~ ., 
               data = training,
               method = "nnet",  
               preProcess = c("center", "scale", "nzv"),
               trControl = ctrl,
               tuneGrid = expand.grid(
                 size = best_nn_size,
                 decay = best_nn_decay
               ),
               metric = "RMSE",
               trace = FALSE,  
               maxit = 500)

# predict and evaluate
nnPred <- exp(predict(nnFit, testing))
nn_rmse <- sqrt(mean((nnPred - testing$price)^2))
nn_r2 <- cor(nnPred, testing$price)^2

# print results
cat("Neural Network RMSE: $", round(nn_rmse, 2), "\n")
## Neural Network RMSE: $ 597.77
cat("Neural Network R²: ", round(nn_r2, 4), "\n")
## Neural Network R²:  1e-04

In this task Nearest Neighbors performs best, which indicates that location likely plays an important role in price. Gradient Boosting and Random Forest also performed relatively well. Neural Networks had the worst performance, with a very high RMSE and low R², confirming once again that deep learning is not ideal for tabular data with limited features.

Comparing Models

Finally, I will compare all regression models including regularization and ML models.

# store model predictions
test_results <- data.frame(
  price = testing$price,
  Ridge = exp(predict(ridge_tune, testing)),
  Lasso = exp(predict(lasso_tune, testing)),
  ElasticNet = exp(predict(elastic_tune, testing)),
  KNN = exp(predict(knnFit, testing)),
  RandomForest = exp(predict(rfFit, testing)),
  GradientBoosting = exp(predict(gbmFit, testing)),
  NeuralNetwork = exp(predict(nnFit, testing))
)

# compute RMSE, R², and MAE
model_results <- data.frame(
  Model = colnames(test_results[-1]),
  RMSE = apply(test_results[-1], 2, function(x) sqrt(mean((x - test_results$price)^2))),
  R2 = apply(test_results[-1], 2, function(x) cor(x, test_results$price)^2),
  MAE = apply(test_results[-1], 2, function(x) mean(abs(x - test_results$price)))  # Your professor's method
)

# sort models by RMSE (best model = lowest RMSE)
model_results <- model_results |> arrange(RMSE)
print(model_results)
##                             Model     RMSE           R2       MAE
## KNN                           KNN 336.4258 8.859034e-01  63.57764
## GradientBoosting GradientBoosting 391.9008 6.893061e-01  23.48350
## RandomForest         RandomForest 443.5194 4.338702e-01  14.84255
## ElasticNet             ElasticNet 513.4737 1.474570e-01  52.35895
## Lasso                       Lasso 513.4877 1.474073e-01  52.42760
## Ridge                       Ridge 513.6991 1.467067e-01  52.43182
## NeuralNetwork       NeuralNetwork 597.7673 7.830039e-05 219.32712

Similarly to classification, the ML regression models outperformed the regularization models in prediction, with Nearest Neighbors having the highest prediction accuracy overall.

Ensemble Model

I once again combined my top-performing models into an ensemble model, to balance the strengths and weaknesses of both.

# get predicted prices from individual models (exponentiate if using log-price)
rf_pred <- exp(predict(rfFit, testing))
gbm_pred <- exp(predict(gbmFit, testing))
knn_pred <- exp(predict(knnFit, testing))

# create ensemble prediction by averaging model outputs
ensemble_pred <- (rf_pred + gbm_pred + knn_pred) / 3

# evaluate ensemble model performance
ensemble_rmse <- sqrt(mean((ensemble_pred - testing$price)^2))
ensemble_r2 <- cor(ensemble_pred, testing$price)^2

# print results
cat("Ensemble Model RMSE: $", round(ensemble_rmse, 2), "\n")
## Ensemble Model RMSE: $ 385.42
cat("Ensemble Model R²: ", round(ensemble_r2, 4), "\n")
## Ensemble Model R²:  0.7451

While this model does not have quite as high of an R² as Nearest Neighbors, it is important to note that this model is likely more robust.

Conclusion

This exercise reinforced key insights about model selection based on the task. Tree-based models (Random Forest, GBM) consistently performed well for both classification and regression, with more computationally expensive models achieving higher accuracy. Neural networks, although powerful, are generally not well suited for the type of tabular data used in this exercise.

Another key takeaway is that statistical methods are generally better for understanding relationships, while machine learning excels at optimizing predictions. However, as descriptive ML techniques become more common, the line between interpretability and accuracy is shifting. Ultimately, the best model depends on the goal—prioritizing explainability or maximizing predictive performance, and this analysis demonstrated when each approach is most effective.

Bibliography